For the purposes of this Project the following variables and definitions are important.
The population in this dataset is the incoming cohort of students in Fall of 2019 and 2020. These students are first time degree or certificate seekers and have no prior tertiary education. They may have earned AP credits in highschool.
Fall2019 refers to the incoming freshman cohort in Fall2019. This is term year 2020.
Fall2020 refers to the incoming freshman cohort in Fall2020. This is term year 2021.
Variables of Interest: term year Incoming students in Fall2019 are assigned to term year 2020. Incoming students in Fall 2020 are assigned to term year 2021.
hours_earned: refers to credit hours the student has earned in their first Fall semester ( this can include credits earned in Summer school second session- Summer 1 and AP credits earned in high school).
hours_attempted: refers to credit and non credit hours the student has attempted in their first Fall semester ( this may include credits attempted in Summerschool second session - Summer 1).
full_part: is the student full-time (FT) or part-time (PT). Part time students are registered in less than 12 credit hours. Full-time students take at least 12 credits. major: degree programme student is registered for or certificate&LR ( letter of recommendation.) All certificates and letters of recommendations have been grouped together.
hours_earned_rate: Ratio of hours_earned/hours_attempted age: Age of student at start of program.
race: Racial classification of student. sex: Gender classification of student. high_school: Name of highschool student graduted from. Public High schools in Montgomery county are classified as MCPS. pell: Whether the student receives a pell grant or not.
#import data for Fall 2016 to Fall 2020
# set directory
setwd('/Users/ulita/Documents/CAPSTONE205/Capstone/Capstone_Data/Raw_Data')
initial_df_2019<-read_xlsx("FTEIC F19.xlsx")
initial_df_2020<-read_xlsx("FTEIC F20.xlsx")
Read in files from dataMontgomery
#Access list of all highschools in Montgomery County. This will be used to filter highschools in student dataset.
r <- GET('https://data.montgomerycountymd.gov/resource/bfcv-pyzh.json')
#glimpse(r)
#content(r, "text")
jsonRespText <- content(r, as="text")
jsonRespParsed <- content(r, as="parsed")
#jsonRespParsed
df_MC <- fromJSON(jsonRespText)
#head(df_MC)
#class(df_MC)
df_MCPS <- as_tibble(df_MC)
#create dataframe to compare credits in Fall 2019 and Fall 2020 : df_credits
df_initial<- bind_rows(initial_df_2019,initial_df_2020)
head(df_initial)
## # A tibble: 6 x 50
## `U-Number` Sex IPEDS_RACE_DESC `Age at Start` HIGH_SCHOOL FULL_PART CITY
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 20190001 F Hispanic 18 - 20 Out of Sta… PT Gait…
## 2 20190002 M Hispanic 18 - 20 Walter Joh… FT Rock…
## 3 20190003 F Multi-Race 18 - 20 Wheaton Hi… PT Gait…
## 4 20190004 M Asian 18 - 20 Thomas Spr… PT Gait…
## 5 20190005 M Hispanic 18 - 20 FOREIGN / … FT Chev…
## 6 20190006 F Hispanic 18 - 20 Springbroo… FT Silv…
## # … with 43 more variables: STAT_CODE <chr>, ZIP <dbl>, PELL_GRANT <chr>,
## # CAMP_CODE <chr>, Major <chr>, HOURS_ATTEMPTED <chr>, HOURS_EARNED <chr>,
## # MC_GPA <dbl>, PASS_ENGL <chr>, PASS_MATH <chr>, term_year <chr>,
## # TERM_YEAR1_10 <chr>, TERM_YEAR1_20 <chr>, TERM_YEAR1_24 <chr>,
## # TERM_YEAR1_25 <chr>, TERM_YEAR1_30 <chr>, TERM_YEAR1_40 <chr>,
## # TERM_YEAR2_10 <chr>, TERM_YEAR2_20 <chr>, TERM_YEAR2_24 <chr>,
## # TERM_YEAR2_25 <chr>, TERM_YEAR2_30 <chr>, TERM_YEAR2_40 <chr>,
## # TERM_YEAR3_10 <chr>, TERM_YEAR3_20 <chr>, TERM_YEAR3_24 <chr>,
## # TERM_YEAR3_25 <chr>, TERM_YEAR3_30 <chr>, TERM_YEAR3_40 <chr>,
## # TERM_YEAR4_10 <chr>, TERM_YEAR4_20 <chr>, TERM_YEAR4_24 <chr>,
## # TERM_YEAR4_25 <chr>, TERM_YEAR4_30 <chr>, TERM_YEAR4_40 <chr>,
## # TERM_YEAR5_10 <chr>, TERM_YEAR5_20 <chr>, TERM_YEAR5_24 <chr>,
## # TERM_YEAR5_25 <chr>, TERM_YEAR5_30 <chr>, first_grad_date <lgl>,
## # first_degree <lgl>, TERM_YEAR5_40 <chr>
#clean names using janitor package
df_initial<-df_initial %>%
clean_names()
#drop columns not interested in
df_initial<-df_initial %>%
dplyr::select(c(1:20))
# rename columns to make it user friendly
df_initial<-df_initial %>%
rename("age"= "age_at_start", "race"= "ipeds_race_desc", 'summer1'= "term_year1_10", 'fall'='term_year1_20')
# summary statistics of dataframing using function skimr creates a table of summary statistics of all variables in the dataframe
skim(df_initial)
| Name | df_initial |
| Number of rows | 7519 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| sex | 0 | 1.00 | 1 | 1 | 0 | 4 | 0 |
| race | 0 | 1.00 | 5 | 22 | 0 | 9 | 0 |
| age | 0 | 1.00 | 4 | 7 | 0 | 5 | 0 |
| high_school | 0 | 1.00 | 7 | 30 | 0 | 172 | 0 |
| full_part | 0 | 1.00 | 2 | 2 | 0 | 2 | 0 |
| city | 21 | 1.00 | 5 | 19 | 0 | 132 | 0 |
| stat_code | 21 | 1.00 | 2 | 2 | 0 | 16 | 0 |
| pell_grant | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| camp_code | 154 | 0.98 | 1 | 1 | 0 | 6 | 0 |
| major | 33 | 1.00 | 1 | 61 | 0 | 35 | 0 |
| hours_attempted | 87 | 0.99 | 2 | 4 | 0 | 55 | 0 |
| hours_earned | 87 | 0.99 | 2 | 4 | 0 | 56 | 0 |
| pass_engl | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| pass_math | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| term_year | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
| summer1 | 50 | 0.99 | 1 | 1 | 0 | 1 | 0 |
| fall | 50 | 0.99 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| u_number | 0 | 1.00 | 20196656.07 | 5028.73 | 20190001 | 20191880.5 | 20193760.0 | 20201708.5 | 20203588 | ▇▃▁▂▇ |
| zip | 21 | 1.00 | 20884.18 | 1518.62 | 1460 | 20853.0 | 20877.0 | 20902.0 | 94025 | ▁▇▁▁▁ |
| mc_gpa | 87 | 0.99 | 2.16 | 1.48 | 0 | 0.5 | 2.5 | 3.5 | 4 | ▆▂▃▅▇ |
There are 18 variables and 7519 rows of data. Each row of data represents 1 student. There are 3931 rows of data for Fall2019 and 3588 rows of data for Fall2020. Need to change the data types for hours_attempted and hours_earned to numeric. u_number should be a charater. Most of the missing values are in camp_code. For the purposes of this exercise, this variable is not important. There are also 87 missing values in hours attempted and hours earned. As these two are the variables of interest and GPA are of interest, I will drop the rows in these variables with missing values.
# Change datatypes
df_initial$hours_attempted<- as.integer(df_initial$hours_attempted)
df_initial$hours_earned<- as.integer(df_initial$hours_earned)
df_initial$mc_gpa<- as.double(df_initial$mc_gpa)
df_initial$u_number<- as.character(df_initial$u_number)
df_initial$zip<- as.character(df_initial$zip)
#drop missing values in hours_attempted, hours_earned and mc_gap
df_initial <- df_initial %>%
drop_na(hours_attempted,hours_earned)
Summarise data again
skim(df_initial)
| Name | df_initial |
| Number of rows | 7432 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| u_number | 0 | 1.00 | 8 | 8 | 0 | 7432 | 0 |
| sex | 0 | 1.00 | 1 | 1 | 0 | 4 | 0 |
| race | 0 | 1.00 | 5 | 22 | 0 | 9 | 0 |
| age | 0 | 1.00 | 4 | 7 | 0 | 5 | 0 |
| high_school | 0 | 1.00 | 7 | 30 | 0 | 171 | 0 |
| full_part | 0 | 1.00 | 2 | 2 | 0 | 2 | 0 |
| city | 20 | 1.00 | 5 | 19 | 0 | 131 | 0 |
| stat_code | 20 | 1.00 | 2 | 2 | 0 | 16 | 0 |
| zip | 20 | 1.00 | 4 | 5 | 0 | 166 | 0 |
| pell_grant | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| camp_code | 152 | 0.98 | 1 | 1 | 0 | 6 | 0 |
| major | 33 | 1.00 | 1 | 61 | 0 | 35 | 0 |
| pass_engl | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| pass_math | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| term_year | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
| summer1 | 8 | 1.00 | 1 | 1 | 0 | 1 | 0 |
| fall | 8 | 1.00 | 1 | 1 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hours_attempted | 0 | 1 | 12.29 | 6.27 | 0 | 9.0 | 12.0 | 15.0 | 54 | ▆▇▁▁▁ |
| hours_earned | 0 | 1 | 7.72 | 7.40 | 0 | 3.0 | 6.0 | 12.0 | 54 | ▇▃▁▁▁ |
| mc_gpa | 0 | 1 | 2.16 | 1.48 | 0 | 0.5 | 2.5 | 3.5 | 4 | ▆▂▃▅▇ |
After removing the missing values in hours_earned and hours_attempted. There are 8 missing values in Fall. Closer inspection shows that these 8 students have a GPA of 0 even though there are hours attempted and hours earned. I will remove these students.
# remove students whose Fall term is NA
df_initial <- df_initial%>%
filter(., !is.na(fall))
skim(df_initial)
| Name | df_initial |
| Number of rows | 7424 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| u_number | 0 | 1.00 | 8 | 8 | 0 | 7424 | 0 |
| sex | 0 | 1.00 | 1 | 1 | 0 | 4 | 0 |
| race | 0 | 1.00 | 5 | 22 | 0 | 9 | 0 |
| age | 0 | 1.00 | 4 | 7 | 0 | 5 | 0 |
| high_school | 0 | 1.00 | 7 | 30 | 0 | 170 | 0 |
| full_part | 0 | 1.00 | 2 | 2 | 0 | 2 | 0 |
| city | 20 | 1.00 | 5 | 19 | 0 | 131 | 0 |
| stat_code | 20 | 1.00 | 2 | 2 | 0 | 16 | 0 |
| zip | 20 | 1.00 | 4 | 5 | 0 | 166 | 0 |
| pell_grant | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| camp_code | 152 | 0.98 | 1 | 1 | 0 | 6 | 0 |
| major | 33 | 1.00 | 1 | 61 | 0 | 35 | 0 |
| pass_engl | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| pass_math | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| term_year | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
| summer1 | 0 | 1.00 | 1 | 1 | 0 | 1 | 0 |
| fall | 0 | 1.00 | 1 | 1 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hours_attempted | 0 | 1 | 12.29 | 6.27 | 0 | 9.0 | 12.0 | 15.0 | 54 | ▆▇▁▁▁ |
| hours_earned | 0 | 1 | 7.72 | 7.40 | 0 | 3.0 | 6.0 | 12.0 | 54 | ▇▃▁▁▁ |
| mc_gpa | 0 | 1 | 2.16 | 1.48 | 0 | 0.5 | 2.5 | 3.5 | 4 | ▆▂▃▅▇ |
df_initial%>% group_by(term_year,full_part)%>%
count(hours_attempted==0)
## # A tibble: 7 x 4
## # Groups: term_year, full_part [4]
## term_year full_part `hours_attempted == 0` n
## <chr> <chr> <lgl> <int>
## 1 2020 FT FALSE 2251
## 2 2020 FT TRUE 3
## 3 2020 PT FALSE 1606
## 4 2020 PT TRUE 27
## 5 2021 FT FALSE 2147
## 6 2021 PT FALSE 1373
## 7 2021 PT TRUE 17
Term year 2020: There are 3 full time and 27 part time students who attempted zero hours. Term year 2021: There are 17 part time students who attempted zero hours. These 47 observations will be dropped.
# remove students who attempted 0 hours
df_initial <- df_initial%>%
filter(., hours_attempted != 0)
skim(df_initial)
| Name | df_initial |
| Number of rows | 7377 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| u_number | 0 | 1.00 | 8 | 8 | 0 | 7377 | 0 |
| sex | 0 | 1.00 | 1 | 1 | 0 | 4 | 0 |
| race | 0 | 1.00 | 5 | 22 | 0 | 9 | 0 |
| age | 0 | 1.00 | 4 | 7 | 0 | 5 | 0 |
| high_school | 0 | 1.00 | 7 | 30 | 0 | 168 | 0 |
| full_part | 0 | 1.00 | 2 | 2 | 0 | 2 | 0 |
| city | 19 | 1.00 | 5 | 19 | 0 | 130 | 0 |
| stat_code | 19 | 1.00 | 2 | 2 | 0 | 16 | 0 |
| zip | 19 | 1.00 | 4 | 5 | 0 | 164 | 0 |
| pell_grant | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| camp_code | 152 | 0.98 | 1 | 1 | 0 | 6 | 0 |
| major | 33 | 1.00 | 1 | 61 | 0 | 35 | 0 |
| pass_engl | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| pass_math | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
| term_year | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
| summer1 | 0 | 1.00 | 1 | 1 | 0 | 1 | 0 |
| fall | 0 | 1.00 | 1 | 1 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hours_attempted | 0 | 1 | 12.36 | 6.21 | 1 | 9.00 | 12.0 | 15.0 | 54 | ▆▇▁▁▁ |
| hours_earned | 0 | 1 | 7.77 | 7.39 | 0 | 3.00 | 6.0 | 12.0 | 54 | ▇▃▁▁▁ |
| mc_gpa | 0 | 1 | 2.18 | 1.48 | 0 | 0.67 | 2.5 | 3.5 | 4 | ▆▂▃▅▇ |
Check missing values in variable major
#check majors list for undeclared and missing values.
df_initial%>% count(major)
## # A tibble: 36 x 2
## major n
## <chr> <int>
## 1 0 16
## 2 American Sign Language 12
## 3 Applied Geography 10
## 4 Architectural Technology 74
## 5 Art 108
## 6 Broadcast Media 29
## 7 Building Trade 88
## 8 Business 879
## 9 Certificate & LR 221
## 10 Communication 68
## # … with 26 more rows
dim(df_initial)
## [1] 7377 20
There are 72 students with undeclared majors. 16 students with majors listed as 0 and 33 listed as NA. Due to privacy reasons, students have been grouped so as not to identify individual students. Degree Major groupings with less than 10 students have been listed as NA or 0. At this point in time I will not remove these data points, because for the purposes of this study, Major names are not required. I will change the NA in major to ‘Field2’ and the 0 to Field1. Students enrolled for a certificate or Letter of Recommendation have been grouped togther irrespective of field.
There are 221 Certificate & LR students in the initial dataset. I will drop these non-degree programmes and focus on degree majors only.
Create dataframe with degrees only. I have retained the majors grouped as NA.
#remove certificates, NA refers to degree majors which have not been named to protect student privacy.
df_Degrees<-subset(df_initial, major!="Certificate & LR")
df_Degrees%>%count(major)
## # A tibble: 34 x 2
## major n
## <chr> <int>
## 1 0 16
## 2 American Sign Language 12
## 3 Applied Geography 10
## 4 Architectural Technology 74
## 5 Art 108
## 6 Broadcast Media 29
## 7 Building Trade 88
## 8 Business 879
## 9 Communication 68
## 10 Computer Applications 81
## # … with 24 more rows
save dataframe Degrees
write.csv(df_Degrees,'df_Degrees.csv')
#count number of full time versus part-time degree students.
df_Degrees%>% group_by(term_year)%>%
count(full_part)
## # A tibble: 4 x 3
## # Groups: term_year [2]
## term_year full_part n
## <chr> <chr> <int>
## 1 2020 FT 2222
## 2 2020 PT 1527
## 3 2021 FT 2088
## 4 2021 PT 1286
# change in overall student population from 2020 to 2021
df_Degrees%>%
group_by(term_year,full_part)%>%
count(full_part)%>%
group_by(term_year)%>%
mutate(total_pop =sum(n))%>%
group_by(full_part)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 4 x 5
## # Groups: full_part [2]
## term_year full_part n total_pop pct_change
## <chr> <chr> <int> <int> <dbl>
## 1 2020 FT 2222 3749 NA
## 2 2021 FT 2088 3374 -6.03
## 3 2020 PT 1527 3749 NA
## 4 2021 PT 1286 3374 -15.8
# decrease in total student population
((2251+1606)-(2147+1373))/(2251+1606)*100
## [1] 8.737361
Degree Seeking students: There was an 8.7% decrease in the incoming cohort ( part time and full time) size from 2020 to 2021. In term year 2020, there were 2222 full time degree seeking students and 1527 part time degree seeking students. In term year 2021, there were 2088 full time degree seeking students and 1286 part time degree seeking students. There is a 6 % drop in full time enrollment from 2020 to 2021. There is a 15.78 % drop in part time enrollment from 2020 to 2021.
Frequency of Students Part time versus Full tim: 2020 vs 2021
# Number of students part time abnd full time 2020 vs 2021
ggplot(data=df_Degrees, aes(x=full_part, fill=full_part)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=2,size=3)+
facet_wrap(~term_year)+
ggtitle("Number of Students Full time versus Part time")+
ylab('Frequency')+
xlab("")+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
Proportion of Students Full time versus Part time: 2020 vs 2021
df_Degrees %>%
group_by(term_year) %>%
count(full_part) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = full_part, y = prop)) +
geom_col(aes(fill = full_part), position = "dodge") +
geom_text(aes(label = scales::percent(prop),
y = prop,
group = full_part),
position = position_dodge(width = 0.9),
vjust = 2,size=3)+
facet_wrap(~term_year)+
ggtitle("Proportion of Students Full time versus Part time")+
ylab('Percentage')+
xlab("")+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
Approximately 3/5 of students in each term year are full time and 2/5 are part time.
Count of Race Groups
ggplot(data=df_Degrees, aes(x=race, fill=race)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0,size=3)+
facet_wrap(~term_year + full_part)+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())+
ggtitle("Number of Students per a Race Group")+
xlab("Race")+
ylab("Frequency")
Percentage of each race group in Student Population
df_Degrees %>%
group_by(term_year,full_part) %>%
count(race) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = race, y = prop)) +
geom_col(aes(fill = race), position = "dodge") +
geom_text(aes(label = scales::percent(prop,0.1),
y = prop,
group = race),
position = position_dodge(width = 0.9),
vjust = 0,size=3)+
facet_wrap(~term_year + full_part)+
ggtitle("Race Proportion of Students: Full time versus Part time")+
ylab('Proportion ')+
xlab("")+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
Hispanic students form the largest proportion in the full time and part time student populations, followed by African American, White, Asian and Foreign students.
df_Degrees%>%
filter(full_part=='FT')%>%
group_by(race,term_year)%>%
tally()
## # A tibble: 18 x 3
## # Groups: race [9]
## race term_year n
## <chr> <chr> <int>
## 1 Am. Indian / AK Native 2020 6
## 2 Am. Indian / AK Native 2021 4
## 3 Asian 2020 318
## 4 Asian 2021 274
## 5 Black / African Am. 2020 569
## 6 Black / African Am. 2021 496
## 7 Foreign 2020 212
## 8 Foreign 2021 176
## 9 Hawaiian / Pac. Isl. 2020 6
## 10 Hawaiian / Pac. Isl. 2021 3
## 11 Hispanic 2020 653
## 12 Hispanic 2021 722
## 13 Multi-Race 2020 81
## 14 Multi-Race 2021 77
## 15 Unknown 2020 19
## 16 Unknown 2021 5
## 17 White 2020 358
## 18 White 2021 331
Full time student: Change in enrollment from 2020 to 2021 based on Race
# calculate percentage change in full time student enrollment from 2020 to 2021 by race
df_Degrees%>%
filter(full_part=="FT")%>%
group_by(term_year,race)%>%
count(race)%>%
group_by(race)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 18 x 4
## # Groups: race [9]
## term_year race n pct_change
## <chr> <chr> <int> <dbl>
## 1 2020 Am. Indian / AK Native 6 NA
## 2 2021 Am. Indian / AK Native 4 -33.3
## 3 2020 Asian 318 NA
## 4 2021 Asian 274 -13.8
## 5 2020 Black / African Am. 569 NA
## 6 2021 Black / African Am. 496 -12.8
## 7 2020 Foreign 212 NA
## 8 2021 Foreign 176 -17.0
## 9 2020 Hawaiian / Pac. Isl. 6 NA
## 10 2021 Hawaiian / Pac. Isl. 3 -50
## 11 2020 Hispanic 653 NA
## 12 2021 Hispanic 722 10.6
## 13 2020 Multi-Race 81 NA
## 14 2021 Multi-Race 77 -4.94
## 15 2020 Unknown 19 NA
## 16 2021 Unknown 5 -73.7
## 17 2020 White 358 NA
## 18 2021 White 331 -7.54
Full time students: In 2021 there was a 10.57% increase of Hispanic students when compared to 2020.The was a 12.83 % decrease in African American Students, 13.84 % decrease in Asian Students, a 7,54% decrease in white students and a 17.81% decrease in foreign students.
Part time student: Change in enrollment from 2020 to 2021 based on Race
df_Degrees%>%
filter(full_part=='PT')%>%
group_by(race,term_year)%>%
tally()
## # A tibble: 18 x 3
## # Groups: race [9]
## race term_year n
## <chr> <chr> <int>
## 1 Am. Indian / AK Native 2020 6
## 2 Am. Indian / AK Native 2021 2
## 3 Asian 2020 132
## 4 Asian 2021 101
## 5 Black / African Am. 2020 404
## 6 Black / African Am. 2021 352
## 7 Foreign 2020 167
## 8 Foreign 2021 101
## 9 Hawaiian / Pac. Isl. 2020 2
## 10 Hawaiian / Pac. Isl. 2021 4
## 11 Hispanic 2020 491
## 12 Hispanic 2021 396
## 13 Multi-Race 2020 51
## 14 Multi-Race 2021 51
## 15 Unknown 2020 16
## 16 Unknown 2021 4
## 17 White 2020 258
## 18 White 2021 275
Part time student: Change in enrollment from 2020 to 2021 based on Race
# calculate percentage change in part time student enrollment from 2020 to 2021 by race
df_Degrees%>%
filter(full_part=="PT")%>%
group_by(term_year,race)%>%
count(race)%>%
group_by(race)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 18 x 4
## # Groups: race [9]
## term_year race n pct_change
## <chr> <chr> <int> <dbl>
## 1 2020 Am. Indian / AK Native 6 NA
## 2 2021 Am. Indian / AK Native 2 -66.7
## 3 2020 Asian 132 NA
## 4 2021 Asian 101 -23.5
## 5 2020 Black / African Am. 404 NA
## 6 2021 Black / African Am. 352 -12.9
## 7 2020 Foreign 167 NA
## 8 2021 Foreign 101 -39.5
## 9 2020 Hawaiian / Pac. Isl. 2 NA
## 10 2021 Hawaiian / Pac. Isl. 4 100
## 11 2020 Hispanic 491 NA
## 12 2021 Hispanic 396 -19.3
## 13 2020 Multi-Race 51 NA
## 14 2021 Multi-Race 51 0
## 15 2020 Unknown 16 NA
## 16 2021 Unknown 4 -75
## 17 2020 White 258 NA
## 18 2021 White 275 6.59
Part time degree seeking students: In 2021 there was a 19.35% decrease of Hispanic part time degree seeking students when compared to 2020. The was a 12.87 % decrease in African American Students, 23.48 % decrease in Asian Students, a 6.59% increase in white students and a 39.52% decrease in foreign students.
Age distribution of Students Part time versus Full tim: 2020 vs 2021
# Age distribution of students 2020 vs 2021
ggplot(data=df_Degrees, aes(x=age, fill=age)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0,size=3)+
facet_wrap(~term_year+full_part)+
ggtitle("Age Distribution of Students: Full time versus Part time")+
ylab('Frequency')+
xlab("")+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
Age distribution of Full time Students
df_Degrees%>%
filter(full_part=="FT")%>%
group_by(term_year,age)%>%
count(age)%>%
group_by(term_year)%>%
mutate(total_pop =sum(n))%>%
group_by(age)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_pop= (n/total_pop*100))
## # A tibble: 10 x 5
## # Groups: age [5]
## term_year age n total_pop pct_pop
## <chr> <chr> <int> <int> <dbl>
## 1 2020 < 18 178 2222 8.01
## 2 2021 < 18 167 2088 8.00
## 3 2020 18 - 20 1878 2222 84.5
## 4 2021 18 - 20 1776 2088 85.1
## 5 2020 21 - 24 77 2222 3.47
## 6 2021 21 - 24 62 2088 2.97
## 7 2020 25 - 29 41 2222 1.85
## 8 2021 25 - 29 37 2088 1.77
## 9 2020 30 + 48 2222 2.16
## 10 2021 30 + 46 2088 2.20
In term year 2020 and term year 2021 ,approximately 85% of the full time and part time students are between 18-20yrs. In term year 2020, 66.86% of the part time students were between 18-20yrs and in term year 2021, 70.68% students.
Age distribution of Part time Students
df_Degrees%>%
filter(full_part=="PT")%>%
group_by(term_year,age)%>%
count(age)%>%
group_by(term_year)%>%
mutate(total_pop =sum(n))%>%
group_by(age)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_pop= (n/total_pop*100))
## # A tibble: 10 x 5
## # Groups: age [5]
## term_year age n total_pop pct_pop
## <chr> <chr> <int> <int> <dbl>
## 1 2020 < 18 74 1527 4.85
## 2 2021 < 18 65 1286 5.05
## 3 2020 18 - 20 1021 1527 66.9
## 4 2021 18 - 20 909 1286 70.7
## 5 2020 21 - 24 149 1527 9.76
## 6 2021 21 - 24 108 1286 8.40
## 7 2020 25 - 29 99 1527 6.48
## 8 2021 25 - 29 69 1286 5.37
## 9 2020 30 + 184 1527 12.0
## 10 2021 30 + 135 1286 10.5
The part time student population in term year 2020 and term year 2021 has a higher proportion of 21-24, 25-29 and 30+ and a lower proportion of under 18s than the full time student population.
Gender of Students
# Gender of students part time and full time 2020 vs 2021
ggplot(data=df_Degrees, aes(x=sex, fill=sex)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=1,size=3)+
facet_wrap(~term_year+full_part)+
ggtitle("Gender of Students: Full time versus Part time")+
ylab('Frequency')+
xlab("")+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
Gender 0 and Gender N are unknown values, however they have not been removed from the dataset as the variables of interest are hours_attempted and hours_earned. Some students choose not to report a gender or are transgender.This is approximately 1% of the student population.
Percentage of Full time Students by Gender
df_Degrees%>%
filter(full_part=="FT")%>%
group_by(term_year,sex)%>%
count(sex)%>%
group_by(term_year)%>%
mutate(total_pop =sum(n))%>%
group_by(sex)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_pop= (n/total_pop*100))
## # A tibble: 8 x 5
## # Groups: sex [4]
## term_year sex n total_pop pct_pop
## <chr> <chr> <int> <int> <dbl>
## 1 2020 0 4 2222 0.180
## 2 2021 0 2 2088 0.0958
## 3 2020 F 1068 2222 48.1
## 4 2021 F 1090 2088 52.2
## 5 2020 M 1128 2222 50.8
## 6 2021 M 973 2088 46.6
## 7 2020 N 22 2222 0.990
## 8 2021 N 23 2088 1.10
******CHECK*** In term year 2020, there was approximately 2.58% more full time male students than full time female students. In term year 2021, there was approximately 4.89% more full time female students than full time male students. It is evident that the full time male population have decreased more than the full time female student population in term year 2021.
Percentage of Part time Students by Gender
df_Degrees%>%
filter(full_part=="PT")%>%
group_by(term_year,sex)%>%
count(sex)%>%
group_by(term_year)%>%
mutate(total_pop =sum(n))%>%
group_by(sex)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_pop= (n/total_pop*100))
## # A tibble: 8 x 5
## # Groups: sex [4]
## term_year sex n total_pop pct_pop
## <chr> <chr> <int> <int> <dbl>
## 1 2020 0 5 1527 0.327
## 2 2021 0 1 1286 0.0778
## 3 2020 F 793 1527 51.9
## 4 2021 F 668 1286 51.9
## 5 2020 M 701 1527 45.9
## 6 2021 M 599 1286 46.6
## 7 2020 N 28 1527 1.83
## 8 2021 N 18 1286 1.40
CHECK
In term year 2020, there was approximately 5.67 % more part time female students than part time male students. In term year 2021, there was approximately 3.42% more part time female students than part time male students.
Calculate percentage change in full time student enrollment from 2020 to 2021 by gender
# calculate percentage change in full time student enrollment from 2020 to 2021 by gender
df_Degrees%>%
filter(full_part=="FT")%>%
filter(sex=="F"|sex =="M")%>%
group_by(term_year,sex)%>%
count(sex)%>%
group_by(sex)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 4 x 4
## # Groups: sex [2]
## term_year sex n pct_change
## <chr> <chr> <int> <dbl>
## 1 2020 F 1068 NA
## 2 2021 F 1090 2.06
## 3 2020 M 1128 NA
## 4 2021 M 973 -13.7
Full time male students: 13.74% decline in registration Full time female students: 2.06% increase in registration
Calculate percentage change in part time student enrollment from 2020 to 2021 by gender
# calculate percentage change in part time student enrollment from 2020 to 2021 by gender
df_Degrees%>%
filter(full_part=="PT")%>%
filter(sex=="F"|sex =="M")%>%
group_by(term_year,sex)%>%
count(sex)%>%
group_by(sex)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 4 x 4
## # Groups: sex [2]
## term_year sex n pct_change
## <chr> <chr> <int> <dbl>
## 1 2020 F 793 NA
## 2 2021 F 668 -15.8
## 3 2020 M 701 NA
## 4 2021 M 599 -14.6
Part time male students: 14.55% decrease in registration Part time female students: 15.76% decrease in registration
Gender and Race breakdown of full time students
# Gender and Race of full time students 2020 vs 2021
df_Degrees%>%
filter(sex %in% c("F","M"))%>%
filter(full_part=="FT")%>%
ggplot(., aes(x=race, fill=race)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0, size=3)+
facet_wrap(~term_year+sex)+
ggtitle("Gender and Race of Full time Students")+
ylab('Frequency')+
xlab("")+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
# theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
Gender and Race breakdown of part time students
# Gender and Race of part time students 2020 vs 2021
df_Degrees %>%
filter(sex %in% c("F","M"))%>%
filter(full_part=="PT")%>%
ggplot(., aes(x=race, fill=race)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0,size=3)+
facet_wrap(~term_year+sex)+
ggtitle("Gender and Race of Part time Students")+
ylab('Frequency')+
xlab("")+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
Full time Student Enrollment Percentages trend by Gender and race
# calculate percentage change in student enrollment from 2020 to 2021 by race and gender
# create data frames with counts of full time students by race and gender
df_Degrees%>%
filter(full_part=="FT")%>%
filter(sex=="F"|sex =="M")%>%
group_by(term_year,race,sex)%>%
count(sex)%>%
group_by(race,sex)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 36 x 5
## # Groups: race, sex [18]
## term_year race sex n pct_change
## <chr> <chr> <chr> <int> <dbl>
## 1 2020 Am. Indian / AK Native F 5 NA
## 2 2021 Am. Indian / AK Native F 2 -60
## 3 2020 Am. Indian / AK Native M 1 NA
## 4 2021 Am. Indian / AK Native M 2 100
## 5 2020 Asian F 136 NA
## 6 2021 Asian F 136 0
## 7 2020 Asian M 180 NA
## 8 2021 Asian M 135 -25
## 9 2020 Black / African Am. F 265 NA
## 10 2021 Black / African Am. F 259 -2.26
## # … with 26 more rows
Full time male students: There was a 25% decrease in asian male students, -21.23 % decrease in african american males, a 23.91 percent decrease in foreign males, a 10.84% decrease in white males and a 2.68% increase in Hispanic males. Full time female students: -2.26 % decrease in african american females, a 13.33 percent decrease in foreign females, a 4.61% decrease in white females and a 17.63% increase in Hispanic females.
Part time Student Enrollment Percentages trend by Gender and race
# calculate percentage change in student enrollment from 2020 to 2021 by race and gender
# create data frames with counts of full time students by race and gender
df_Degrees%>%
filter(full_part=="PT")%>%
filter(sex=="F"|sex =="M")%>%
group_by(term_year,race,sex)%>%
count(sex)%>%
group_by(race,sex)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 34 x 5
## # Groups: race, sex [18]
## term_year race sex n pct_change
## <chr> <chr> <chr> <int> <dbl>
## 1 2020 Am. Indian / AK Native F 2 NA
## 2 2020 Am. Indian / AK Native M 4 NA
## 3 2021 Am. Indian / AK Native M 2 -50
## 4 2020 Asian F 69 NA
## 5 2021 Asian F 41 -40.6
## 6 2020 Asian M 59 NA
## 7 2021 Asian M 60 1.69
## 8 2020 Black / African Am. F 205 NA
## 9 2021 Black / African Am. F 188 -8.29
## 10 2020 Black / African Am. M 194 NA
## # … with 24 more rows
# Pell Grant
ggplot(data=df_Degrees, aes(x=pell_grant, fill=pell_grant)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=3, size=3)+
facet_wrap(~term_year+full_part)+
ggtitle("Pell grant")+
ylab('Frequency')+
xlab("")+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
If the data is correct, no students received pell grants in term year 2021???!!!!
df_Degrees%>%group_by(term_year,full_part) %>%
count(high_school)
## # A tibble: 391 x 4
## # Groups: term_year, full_part [4]
## term_year full_part high_school n
## <chr> <chr> <chr> <int>
## 1 2020 FT ** HOME SCHOOLED ** 15
## 2 2020 FT Academy of Holy Cross 2
## 3 2020 FT Al-Huda School 2
## 4 2020 FT Albert Einstein HS & MC Art Cn 75
## 5 2020 FT Avalon School, The - Bethesda 3
## 6 2020 FT Barrie School - HS Div 1
## 7 2020 FT Bethesda Chevy Chase High Schl 41
## 8 2020 FT Bishop McNamara High School 1
## 9 2020 FT Bladensburg High School 4
## 10 2020 FT Bowie High School 2
## # … with 381 more rows
Filter highschools MCPS
# high school names in student data is listed as xxx High School, whilst in MCPS it is abbreviated to HS.
# Modify MCPS dataframe to change HS to High School
# Modify MSCPS dataframe to match highschool names in MC College Student Data.
df_MCPS<-
df_MCPS %>%
mutate(across("school_name", str_replace, "HS", "High School"))
#Correct discrepanies in names for MCPS highschools.
# Reason: discrepancies in names in MC College data versus Montgomery Data.
# MC College corrections : Bethesda Chevy Chase High Schl, Thomas Sprigg Wootton High Sch, Poolesville Jr-Sr High School,Colonel Zadok Magruder HS,
#Northwest HS - Germantown, Quince Orchard Sr High School,Albert Einstein HS & MC Art Cn, John F. Kennedy High School,Springbrook Sr High School,
df_MCPS<-
df_MCPS %>%
mutate(across("school_name", str_replace, "Bethesda-Chevy Chase High School", "Bethesda Chevy Chase High Schl"))
df_MCPS<-
df_MCPS %>%
mutate(across("school_name", str_replace, "Thomas S Wootton High School", "Thomas Sprigg Wootton High Sch"))
df_MCPS<-
df_MCPS %>%
mutate(across("school_name", str_replace, "Poolesville High School", "Poolesville Jr-Sr High School"))
df_MCPS<-
df_MCPS %>%
mutate(across("school_name", str_replace, "Col Zadok Magruder High School", "Colonel Zadok Magruder HS"))
df_MCPS<-
df_MCPS %>%
mutate(across("school_name", str_replace, "Northwest High School", "Northwest HS - Germantown"))
df_MCPS<-
df_MCPS %>%
mutate(across("school_name", str_replace, "Quince Orchard High School", "Quince Orchard Sr High School"))
df_MCPS<-
df_MCPS %>%
mutate(across("school_name", str_replace, "Albert Einstein High School", "Albert Einstein HS & MC Art Cn"))
df_MCPS<-
df_MCPS %>%
mutate(across("school_name", str_replace, "John F Kennedy High School", "John F. Kennedy High School"))
df_MCPS<-
df_MCPS %>%
mutate(across("school_name", str_replace, "Springbrook High School", "Springbrook Sr High School"))
df_MCPS
## # A tibble: 25 x 15
## school_name zip_code location$latitu… $human_address $needs_recoding
## <chr> <chr> <chr> <chr> <lgl>
## 1 Walter Joh… 20814 39.0253918855 "{\"address\"… FALSE
## 2 Bethesda C… 20814 38.9868264765 "{\"address\"… FALSE
## 3 Walt Whitm… 20817 38.9816307564 "{\"address\"… FALSE
## 4 Poolesvill… 20837 39.1431033716 "{\"address\"… FALSE
## 5 Thomas Spr… 20850 39.0765821359 "{\"address\"… FALSE
## 6 Rockville … 20851 39.0863478113 "{\"address\"… FALSE
## 7 Richard Mo… 20852 39.0772920776 "{\"address\"… FALSE
## 8 Winston Ch… 20854 39.0443047898 "{\"address\"… FALSE
## 9 Colonel Za… 20855 39.1313109818 "{\"address\"… FALSE
## 10 Sherwood H… 20860 39.148342372 "{\"address\"… FALSE
## # … with 15 more rows, and 13 more variables: $longitude <chr>, city <chr>,
## # `:@computed_region_vu5j_pcmz` <chr>, address <chr>,
## # `:@computed_region_d7bw_bq6x` <chr>, `:@computed_region_kbsp_ykn9` <chr>,
## # latitude <chr>, url <chr>, `:@computed_region_tx5f_5em3` <chr>,
## # category <chr>, longitude <chr>, `:@computed_region_rbt8_3x7n` <chr>,
## # phone <chr>
df_Degrees<- df_Degrees %>%
mutate( MCPS = df_Degrees$high_school%in% df_MCPS$school_name)
df_Degrees%>%
group_by(term_year,full_part)%>%
filter(MCPS==TRUE)%>%
count(high_school)
## # A tibble: 100 x 4
## # Groups: term_year, full_part [4]
## term_year full_part high_school n
## <chr> <chr> <chr> <int>
## 1 2020 FT Albert Einstein HS & MC Art Cn 75
## 2 2020 FT Bethesda Chevy Chase High Schl 41
## 3 2020 FT Clarksburg High School 76
## 4 2020 FT Colonel Zadok Magruder HS 76
## 5 2020 FT Damascus High School 41
## 6 2020 FT Gaithersburg High School 133
## 7 2020 FT James Hubert Blake High School 74
## 8 2020 FT John F. Kennedy High School 72
## 9 2020 FT Montgomery Blair High School 106
## 10 2020 FT Northwest HS - Germantown 92
## # … with 90 more rows
create highschool catergories: MCPS , Foreign, Homeschool, Out of State, Other MD ( private schools and MD public schools not in MCPS), DC( District of Columbia)
df_Degrees<- df_Degrees%>%
mutate(HS_classify= case_when(
.$high_school == "** HOME SCHOOLED **" ~ "home_school",
.$high_school == "FOREIGN / INTERNATIONAL HS"~"Foreign",
.$high_school == "G. E. D."~"GED",
.$MCPS == TRUE~"MCPS",
.$high_school=="DISTRICT OF COLUMBIA HS"~"DC",
.$high_school == "Out of State High School"~"Out of State",
TRUE~"Other In State"))
count number of MCPS graduates in cohort in term years 2020 and 2021
df_Degrees%>%group_by(term_year,full_part) %>%
count(MCPS)
## # A tibble: 8 x 4
## # Groups: term_year, full_part [4]
## term_year full_part MCPS n
## <chr> <chr> <lgl> <int>
## 1 2020 FT FALSE 546
## 2 2020 FT TRUE 1676
## 3 2020 PT FALSE 642
## 4 2020 PT TRUE 885
## 5 2021 FT FALSE 512
## 6 2021 FT TRUE 1576
## 7 2021 PT FALSE 460
## 8 2021 PT TRUE 826
df_Degrees%>%group_by(term_year,full_part) %>%
count(HS_classify)
## # A tibble: 28 x 4
## # Groups: term_year, full_part [4]
## term_year full_part HS_classify n
## <chr> <chr> <chr> <int>
## 1 2020 FT DC 69
## 2 2020 FT Foreign 193
## 3 2020 FT GED 33
## 4 2020 FT home_school 15
## 5 2020 FT MCPS 1676
## 6 2020 FT Other In State 174
## 7 2020 FT Out of State 62
## 8 2020 PT DC 55
## 9 2020 PT Foreign 265
## 10 2020 PT GED 54
## # … with 18 more rows
Count of High Schools
ggplot(data=df_Degrees, aes(x=HS_classify, fill=HS_classify)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0,size=3)+
facet_wrap(~term_year + full_part)+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())+
ggtitle("High Schools Students' Graduated ")+
xlab("High School")+
ylab("Frequency")
The majority of full time and part time students are from MCPS highschools
Proportion of Students from Highschools
df_Degrees %>%
group_by(term_year,full_part) %>%
count(HS_classify) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = HS_classify, y = prop)) +
geom_col(aes(fill = HS_classify), position = "dodge") +
geom_text(aes(label = scales::percent(prop,0.5),
y = prop,
group = HS_classify),
position = position_dodge(width = 0.9),
vjust = 0.25,size=3)+
facet_wrap(~term_year + full_part)+
ggtitle("High schools students graduated")+
ylab('Proportion ')+
xlab("")+
theme(axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
75.5% of full time degree students in term year 2020 and term year 2021 graduated from MCPS highschools. 58% of part time degree students in term year 2020 and 64% of part time students in term year 2021 graduated from MCPS highschool.
Breakdown of Highschools Full time students in term year 2020 attended in MCPS
df_Degrees%>%
filter(full_part=="FT" & term_year=="2020")%>%
filter(HS_classify=="MCPS")%>%
group_by(term_year,high_school)%>%
count(high_school)%>%
group_by(term_year)%>%
mutate(total_pop =sum(n))%>%
group_by(high_school)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_pop= (n/total_pop*100))%>%
arrange(desc(pct_pop))
## # A tibble: 25 x 5
## # Groups: high_school [25]
## term_year high_school n total_pop pct_pop
## <chr> <chr> <int> <int> <dbl>
## 1 2020 Gaithersburg High School 133 1676 7.94
## 2 2020 Montgomery Blair High School 106 1676 6.32
## 3 2020 Springbrook Sr High School 93 1676 5.55
## 4 2020 Northwest HS - Germantown 92 1676 5.49
## 5 2020 Paint Branch High School 92 1676 5.49
## 6 2020 Wheaton High School 83 1676 4.95
## 7 2020 Richard Montgomery High School 77 1676 4.59
## 8 2020 Clarksburg High School 76 1676 4.53
## 9 2020 Colonel Zadok Magruder HS 76 1676 4.53
## 10 2020 Albert Einstein HS & MC Art Cn 75 1676 4.47
## # … with 15 more rows
The Graph below gives the overall proportions of full time students from highschools in MCPS in term_year 2020. As there are 25 highschools, I have not included the key. The table above give the percentages of Full time students who attended each high school. In term year 2020: 7.93% of full time students finished high school at Gaithersburg High. 6.32% from Montgomery Blair HS. 5.54% from Springbok Sr High School. 5.49% from Northwest HS - Germantown. 5.49% from Paint Branch High School. 9 highschools had a arrange between 4 and 5% On the other end of the spectrum: Walt Whitman High School at 1.13%. Poolesville Jr-Sr High School 1.31% Winston Churchill 2.03% Thomas Sprigg Wootton High Sch 2.03%
v1<- df_Degrees %>%
group_by(term_year,full_part) %>%
filter(full_part=="FT" & term_year=="2020")%>%
filter(HS_classify=="MCPS")%>%
count(high_school) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = high_school, y = prop)) +
geom_col(aes(fill=high_school), position = "dodge") +
geom_text(aes(label = scales::percent(prop,0.5),
y = prop,
group = high_school),
position = position_dodge(width = 0.9),
vjust = 0, size=3, hjust=0)+
# facet_wrap(~term_year )+
ggtitle("High schools full time students graduated in term year 2020 graduated")+
ylab('Proportion ')+
xlab("")+
theme(legend.position = "none", axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
v1+ coord_flip()
Breakdown of Highschools Full time students in term year 2021 attended in MCPS
df_Degrees%>%
filter(full_part=="FT" & term_year=="2021")%>%
filter(HS_classify=="MCPS")%>%
group_by(term_year,high_school)%>%
count(high_school)%>%
group_by(term_year)%>%
mutate(total_pop =sum(n))%>%
group_by(high_school)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_pop= (n/total_pop*100))%>%
arrange(desc(pct_pop))
## # A tibble: 25 x 5
## # Groups: high_school [25]
## term_year high_school n total_pop pct_pop
## <chr> <chr> <int> <int> <dbl>
## 1 2021 Montgomery Blair High School 99 1576 6.28
## 2 2021 Paint Branch High School 92 1576 5.84
## 3 2021 Gaithersburg High School 91 1576 5.77
## 4 2021 Wheaton High School 90 1576 5.71
## 5 2021 Northwest HS - Germantown 86 1576 5.46
## 6 2021 Colonel Zadok Magruder HS 84 1576 5.33
## 7 2021 Richard Montgomery High School 78 1576 4.95
## 8 2021 Clarksburg High School 75 1576 4.76
## 9 2021 Watkins Mill High School 75 1576 4.76
## 10 2021 John F. Kennedy High School 71 1576 4.51
## # … with 15 more rows
v2<-df_Degrees %>%
group_by(term_year,full_part) %>%
filter(full_part=="FT" & term_year=="2021")%>%
filter(HS_classify=="MCPS")%>%
count(high_school) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = high_school, y = prop)) +
geom_col(aes(fill=high_school), position = "dodge") +
geom_text(aes(label = scales::percent(prop,0.5),
y = prop,
group = high_school),
position = position_dodge(width = 0.9),
vjust = 0, size =3,hjust=0)+
# facet_wrap(~term_year )+
ggtitle("High Schools Full time students in term year 2021 graduated")+
ylab('Proportion ')+
xlab("")+
theme(legend.position = "none", axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
v2+ coord_flip()
# calculate percentage change in full time student enrollment from 2020 to 2021 by MCPS highschool
df_Degrees%>%
filter(full_part=="FT")%>%
filter(HS_classify=="MCPS")%>%
group_by(term_year,high_school)%>%
count(high_school)%>%
group_by(term_year)%>%
group_by(high_school)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 50 x 4
## # Groups: high_school [25]
## term_year high_school n pct_change
## <chr> <chr> <int> <dbl>
## 1 2020 Albert Einstein HS & MC Art Cn 75 NA
## 2 2021 Albert Einstein HS & MC Art Cn 66 -12
## 3 2020 Bethesda Chevy Chase High Schl 41 NA
## 4 2021 Bethesda Chevy Chase High Schl 44 7.32
## 5 2020 Clarksburg High School 76 NA
## 6 2021 Clarksburg High School 75 -1.32
## 7 2020 Colonel Zadok Magruder HS 76 NA
## 8 2021 Colonel Zadok Magruder HS 84 10.5
## 9 2020 Damascus High School 41 NA
## 10 2021 Damascus High School 41 0
## # … with 40 more rows
Full time students: Gaitherburg High had a 31.58% decrease in enrollment. Poolesville Jr-Sr High School 36.36% decrease. Springbrook Sr High School a 31.18% decrease. Walter Johnson High School 28.38% decrease. Rockville High increased enrollment by 40.43%
A more detailed visualisation will be presented on Tableau in EDA.
Highs Schools Part time Students Graduated from. term year 2020 vs term year 2021
Part time students term year 2020:
q1<- df_Degrees %>%
group_by(term_year,full_part) %>%
filter(full_part=="PT" & term_year=="2020")%>%
filter(HS_classify=="MCPS")%>%
count(high_school) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = high_school, y = prop)) +
geom_col(aes(fill=high_school), position = "dodge") +
geom_text(aes(label = scales::percent(prop,0.5),
y = prop,
group = high_school),
position = position_dodge(width = 0.9),
vjust = 0, size=3, hjust=0)+
# facet_wrap(~term_year )+
ggtitle("High schools Part time students graduated in term year 2020 graduated")+
ylab('Proportion ')+
xlab("")+
theme(legend.position = "none", axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
q1+ coord_flip()
Part time students term year 2021:
q2<- df_Degrees %>%
group_by(term_year,full_part) %>%
filter(full_part=="PT" & term_year=="2021")%>%
filter(HS_classify=="MCPS")%>%
count(high_school) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(x = high_school, y = prop)) +
geom_col(aes(fill=high_school), position = "dodge") +
geom_text(aes(label = scales::percent(prop,0.5),
y = prop,
group = high_school),
position = position_dodge(width = 0.9),
vjust = 0, size=3, hjust=0)+
# facet_wrap(~term_year )+
ggtitle("High schools Part time students graduated in term year 2020 graduated")+
ylab('Proportion ')+
xlab("")+
theme(legend.position = "none", axis.text.x=element_blank(),strip.background = element_blank(),panel.grid = element_blank())
q2+ coord_flip()
Breakdown of Highschools Part time students in term year 2020 attended in MCPS
df_Degrees%>%
filter(full_part=="PT" & term_year=="2020")%>%
filter(HS_classify=="MCPS")%>%
group_by(term_year,high_school)%>%
count(high_school)%>%
group_by(term_year)%>%
mutate(total_pop =sum(n))%>%
group_by(high_school)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_pop= (n/total_pop*100))%>%
arrange(desc(pct_pop))
## # A tibble: 25 x 5
## # Groups: high_school [25]
## term_year high_school n total_pop pct_pop
## <chr> <chr> <int> <int> <dbl>
## 1 2020 Northwest HS - Germantown 67 885 7.57
## 2 2020 Gaithersburg High School 63 885 7.12
## 3 2020 Montgomery Blair High School 62 885 7.01
## 4 2020 John F. Kennedy High School 58 885 6.55
## 5 2020 Albert Einstein HS & MC Art Cn 47 885 5.31
## 6 2020 Watkins Mill High School 47 885 5.31
## 7 2020 Clarksburg High School 45 885 5.08
## 8 2020 Paint Branch High School 44 885 4.97
## 9 2020 Richard Montgomery High School 42 885 4.75
## 10 2020 Quince Orchard Sr High School 39 885 4.41
## # … with 15 more rows
Breakdown of Highschools Part time students in term year 2021 attended in MCPS
df_Degrees%>%
filter(full_part=="PT" & term_year=="2021")%>%
filter(HS_classify=="MCPS")%>%
group_by(term_year,high_school)%>%
count(high_school)%>%
group_by(term_year)%>%
mutate(total_pop =sum(n))%>%
group_by(high_school)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_pop= (n/total_pop*100))%>%
arrange(desc(pct_pop))
## # A tibble: 25 x 5
## # Groups: high_school [25]
## term_year high_school n total_pop pct_pop
## <chr> <chr> <int> <int> <dbl>
## 1 2021 Gaithersburg High School 52 826 6.30
## 2 2021 Northwest HS - Germantown 50 826 6.05
## 3 2021 Northwood High School 47 826 5.69
## 4 2021 Montgomery Blair High School 45 826 5.45
## 5 2021 Albert Einstein HS & MC Art Cn 39 826 4.72
## 6 2021 Clarksburg High School 39 826 4.72
## 7 2021 Colonel Zadok Magruder HS 39 826 4.72
## 8 2021 Quince Orchard Sr High School 38 826 4.60
## 9 2021 Paint Branch High School 37 826 4.48
## 10 2021 Springbrook Sr High School 37 826 4.48
## # … with 15 more rows
# calculate percentage change in part time student enrollment from 2020 to 2021 by MCPS highschool
df_Degrees%>%
filter(full_part=="PT")%>%
filter(HS_classify=="MCPS")%>%
group_by(term_year,high_school)%>%
count(high_school)%>%
group_by(term_year)%>%
group_by(high_school)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 50 x 4
## # Groups: high_school [25]
## term_year high_school n pct_change
## <chr> <chr> <int> <dbl>
## 1 2020 Albert Einstein HS & MC Art Cn 47 NA
## 2 2021 Albert Einstein HS & MC Art Cn 39 -17.0
## 3 2020 Bethesda Chevy Chase High Schl 22 NA
## 4 2021 Bethesda Chevy Chase High Schl 15 -31.8
## 5 2020 Clarksburg High School 45 NA
## 6 2021 Clarksburg High School 39 -13.3
## 7 2020 Colonel Zadok Magruder HS 36 NA
## 8 2021 Colonel Zadok Magruder HS 39 8.33
## 9 2020 Damascus High School 20 NA
## 10 2021 Damascus High School 22 10
## # … with 40 more rows
Overall Majors trend
Count of Majors in Full time students in 2020
z1<- df_Degrees%>%
filter(full_part=="FT" &term_year =="2020")%>%
ggplot(., aes(x=major, fill=major)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0, hjust=0, size =3)+
ggtitle("Majors of Full-time Students in 2020 ")+
xlab("Major")+
ylab("Frequency")+
theme(legend.position = "none")
z1 + coord_flip()
Count of Majors in Full time students in 2021
z13<- df_Degrees%>%
filter(full_part=="FT" &term_year =="2021")%>%
ggplot(., aes(x=major, fill=major)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0, hjust=0, size =3)+
ggtitle("Majors of Full-time Students in 2021 ")+
xlab("Major")+
ylab("Frequency")+
theme(legend.position = "none")
z13 + coord_flip()
calculate percentage change in full time student majors from 2020 to 2021
df_Degrees%>%
filter(full_part=="FT")%>%
group_by(term_year,major)%>%
count(major)%>%
group_by(term_year)%>%
group_by(major)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 64 x 4
## # Groups: major [34]
## term_year major n pct_change
## <chr> <chr> <int> <dbl>
## 1 2020 0 3 NA
## 2 2021 0 3 0
## 3 2020 American Sign Language 5 NA
## 4 2021 American Sign Language 1 -80
## 5 2020 Applied Geography 2 NA
## 6 2021 Applied Geography 2 0
## 7 2020 Architectural Technology 18 NA
## 8 2021 Architectural Technology 27 50
## 9 2020 Art 32 NA
## 10 2021 Art 27 -15.6
## # … with 54 more rows
Count of Majors in Part time students in 2020
z11<- df_Degrees%>%
filter(full_part=="PT" &term_year =="2020")%>%
ggplot(., aes(x=major, fill=major)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0, hjust=0, size =3)+
ggtitle("Majors of Part-time Students in 2020 ")+
xlab("Major")+
ylab("Frequency")+
theme(legend.position = "none")
z11 + coord_flip()
Count of Majors in Part time students in 2021
z12<- df_Degrees%>%
filter(full_part=="PT" &term_year =="2021")%>%
ggplot(., aes(x=major, fill=major)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=0, hjust=0, size =3)+
ggtitle("Majors of Part-time Students in 2021 ")+
xlab("Major")+
ylab("Frequency")+
theme(legend.position = "none")
z12 + coord_flip()
calculate percentage change in part time student majors from 2020 to 2021
df_Degrees%>%
filter(full_part=="PT")%>%
group_by(term_year,major)%>%
count(major)%>%
group_by(term_year)%>%
group_by(major)%>%
arrange(term_year,.by_group=TRUE)%>%
mutate(pct_change= (n-lag(n))/lag(n)*100)
## # A tibble: 63 x 4
## # Groups: major [33]
## term_year major n pct_change
## <chr> <chr> <int> <dbl>
## 1 2020 0 9 NA
## 2 2021 0 1 -88.9
## 3 2020 American Sign Language 3 NA
## 4 2021 American Sign Language 3 0
## 5 2020 Applied Geography 4 NA
## 6 2021 Applied Geography 2 -50
## 7 2020 Architectural Technology 21 NA
## 8 2021 Architectural Technology 8 -61.9
## 9 2020 Art 23 NA
## 10 2021 Art 26 13.0
## # … with 53 more rows
Majors by Gender
Majors by Race
Hours attempted in Summer 1 are reported with hours_attempted in Fall. Credits earned in Summer 1 are usually reported with hours earned in Fall.
#count number of full time versus part-time degree students.
df_Degrees%>% group_by(term_year,full_part)%>%
count(summer1=='Y')
## # A tibble: 4 x 4
## # Groups: term_year, full_part [4]
## term_year full_part `summer1 == "Y"` n
## <chr> <chr> <lgl> <int>
## 1 2020 FT FALSE 2222
## 2 2020 PT FALSE 1527
## 3 2021 FT FALSE 2088
## 4 2021 PT FALSE 1286
There are 0 students who have attended Summer1. Hence values reported for hours_earned are only for credits earned in fall and AP credits. Hours attempted in Fall are only for non credit developmental courses and credit courses taken in Fall.
# AP credits can be given for courses taken at highschool.
#Check number of students with hours_earned greater than hours_attempted
df_Degrees %>% count(hours_attempted<hours_earned)
## # A tibble: 2 x 2
## `hours_attempted < hours_earned` n
## <lgl> <int>
## 1 FALSE 7119
## 2 TRUE 4
There are 5 students with attempted hours smaller than hours earned. This may be due to AP credits ( credits earned in high school) and will be addressed later.
Histogram of hours_attempted by year
p <- ggplot(df_Degrees, aes(x = hours_attempted))+ geom_histogram(aes(fill = term_year))
p1<- p + facet_wrap(~full_part)
p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Boxplots of hours_attempted by year
p11 = ggplot(df_Degrees, aes(hours_attempted))
p11 + geom_boxplot(aes(colour = term_year)) +
facet_wrap(~full_part)
Density plot of hours_attempted by year
ggplot(df_Degrees, aes(hours_attempted, fill = term_year)) + geom_density(alpha = 0.2) +
facet_wrap(~full_part)+
xlab("Hours attempted") +
ylab( "Density")
Fivenum Summary
df_Degrees%>% group_by(term_year,full_part)%>%
summarise(n = n(),
min = fivenum(hours_attempted)[1],
Q1 = fivenum(hours_attempted)[2],
median = fivenum(hours_attempted)[3],
Q3 = fivenum(hours_attempted)[4],
max = fivenum(hours_attempted)[5],
mean= mean(hours_attempted),
sd = sd(hours_attempted))
## `summarise()` regrouping output by 'term_year' (override with `.groups` argument)
## # A tibble: 4 x 10
## # Groups: term_year [2]
## term_year full_part n min Q1 median Q3 max mean sd
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020 FT 2222 3 12 13 15 53 14.9 5.73
## 2 2020 PT 1527 1 6 8 10 51 8.20 4.82
## 3 2021 FT 2088 3 12 14 16 54 15.4 5.23
## 4 2021 PT 1286 1 6 8 10 47 8.50 4.97
______Code not knitting
d1<-df_Degrees%>% filter(full_part==“FT”)%>% filter(term_year==“2020”)%>% dplyr::select(hours_attempted)
d2<-df_Degrees%>% filter(full_part==“FT”)%>% filter(term_year==“2021”)%>% dplyr::select(hours_attempted)
sband(d1,d2) ______
Hours Attempted by Full-time students: The distributions for the hours attempted by full time students in term year 2020 and term year 2021 have similar shapes. The first quartile for both distributions is 12. The median for term year 2021 (14 hours attempted) is slightly higher than that in 2020 (13 hours attempted). The 3rd quartile for 2020 is 15, whilst in 2021 it is 16. Both distibutions are also skewed right and have similar tail lengths. The maxium for 2020 is 53 and in 2021 it is 54. The distribution for 2021 has lower peaks because there were less students enrolled in 2021 than in 2020. Peaks can be attributed to the fact that most courses are 3 or 4 credit courses. The college offers very few 2 credit courses.
Hours Attempted by Part-time students: The distributions for the hours attempted by part time students in term year 2020 and term year 2021 have similar shapes. The first quartile in 2021 is 5 hours and is slightly lower than 2020 ( 6hours). Both distibutions are skewed right. The right tail for 2021 is slightly shorter. The maxium for 2020 is 51 and in 2021 it is 47. The distribution for 2021 has lower peaks because there were less students enrolled in 2021 than in 2020. Peaks can be attributed to the fact that most courses are 3 or 4 credit courses. The college offers very few 2 credit courses.
Students who attempted less than 3 hours
df_Degrees%>% group_by(term_year,full_part)%>%
count(hours_attempted<3)
## # A tibble: 6 x 4
## # Groups: term_year, full_part [4]
## term_year full_part `hours_attempted < 3` n
## <chr> <chr> <lgl> <int>
## 1 2020 FT FALSE 2222
## 2 2020 PT FALSE 1488
## 3 2020 PT TRUE 39
## 4 2021 FT FALSE 2088
## 5 2021 PT FALSE 1270
## 6 2021 PT TRUE 16
Students who attempted more than 18hours
df_Degrees%>% group_by(term_year,full_part)%>%
count(hours_attempted>18)
## # A tibble: 8 x 4
## # Groups: term_year, full_part [4]
## term_year full_part `hours_attempted > 18` n
## <chr> <chr> <lgl> <int>
## 1 2020 FT FALSE 1898
## 2 2020 FT TRUE 324
## 3 2020 PT FALSE 1473
## 4 2020 PT TRUE 54
## 5 2021 FT FALSE 1798
## 6 2021 FT TRUE 290
## 7 2021 PT FALSE 1241
## 8 2021 PT TRUE 45
41 part time students attempted less than 3 hours in term year 2020 and 16 part time students attempted less than 3 hours in term year 2021. Students also need special permission to enroll in more than 18 credits. In term year 2020, 327 full time and 56 part time students attempted more than 18 hours. In term year 2021, 290 full time and 47 part time degree seeking students attempted more than 18 hours. As only a student who is registered for 12 or more credit hours is considered a full-time student. Students classified as part time and enrolled in more than 12hours, are enrolled in less than 12 credit hours in the Fall semester. As non credit hours and credit hours are included in the hours attempted variable, I will use the hours earned variable to filter the datasets and remove outliers.
Histogram of hours_earned by year
p <- ggplot(df_Degrees, aes(x = hours_earned))+ geom_histogram(aes(fill = term_year))
p1<- p + facet_wrap(~full_part)
p1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Boxplots of hours_earned by year
p11 = ggplot(df_Degrees, aes(hours_earned))
p11 + geom_boxplot(aes(colour = term_year)) +
facet_wrap(~full_part)
Density plot of hours_earned by year
ggplot(df_Degrees, aes(hours_earned, fill = term_year)) + geom_density(alpha = 0.2) +
facet_wrap(~full_part)+
xlab("Hours Earned") +
ylab( "Density")
Hours Earned Fivenum Summary
df_Degrees%>% group_by(term_year,full_part)%>%
summarise(n = n(),
min = fivenum(hours_earned)[1],
Q1 = fivenum(hours_earned)[2],
median = fivenum(hours_earned)[3],
Q3 = fivenum(hours_earned)[4],
max = fivenum(hours_earned)[5],
mean= mean(hours_earned),
sd = sd(hours_earned))
## `summarise()` regrouping output by 'term_year' (override with `.groups` argument)
## # A tibble: 4 x 10
## # Groups: term_year [2]
## term_year full_part n min Q1 median Q3 max mean sd
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020 FT 2222 0 6 9 13 53 10.4 7.59
## 2 2020 PT 1527 0 0 3 6 51 3.77 5.33
## 3 2021 FT 2088 0 6 10 13 54 10.3 7.41
## 4 2021 PT 1286 0 0 3 6 47 4.37 5.56
QQplot of hours earned by Fulltime Students.
d1<- df_Degrees%>%filter(full_part=="FT")
p<-qplot(sample = hours_earned, data = d1, color=term_year)
p
Hours Earned Summary The distributions appeared to be skewed to the right. Incoming students are allowed to get AP credits for courses taken at highschool. The distributions for part-time student in 2020 and 2021 appear to follow a similar distribution. Peaks can be attributed to the fact that most courses are 3 or 4 credit courses. The college offers very few 2 credit courses. Students also need special permission to enroll in more than 18credits. Fulltime Students: The first quartile in term year 2020 and term year 2021 is 6 credits. The median for term year 2021 ( 10 hours earned) is slightly higher than the median in 2020 ( 9 hours earned). Both distributions has long tails on the right, with 2020 havinga maximum of 53 and 2021 having a maxium of 54. Part time Students: Both 2020 and 2021 follow a similar distribution. The first quartile for both years is 0. This means that 25% of part time students earn 0 credits. The median for both distributions is 3, ie. 50% of students earn up to 3 credits. The 3rd quartile for both distributions is 6. Both distributions have long tails.
Students who earned less than 3 hours
df_Degrees%>% group_by(term_year,full_part)%>%
count(hours_earned<3)
## # A tibble: 8 x 4
## # Groups: term_year, full_part [4]
## term_year full_part `hours_earned < 3` n
## <chr> <chr> <lgl> <int>
## 1 2020 FT FALSE 1986
## 2 2020 FT TRUE 236
## 3 2020 PT FALSE 830
## 4 2020 PT TRUE 697
## 5 2021 FT FALSE 1813
## 6 2021 FT TRUE 275
## 7 2021 PT FALSE 821
## 8 2021 PT TRUE 465
Hours earned equate to credits earned and is the indication of the progress a student is making towards a degree. In term year 2020, 240 full time and 727 part time students earned less than 3 credits. In term year 2021, 290 full time and 504 part time earned less than 3 credits.
Students who earned more than 18 credits
df_Degrees%>% group_by(term_year,full_part)%>%
count(hours_earned>18)
## # A tibble: 8 x 4
## # Groups: term_year, full_part [4]
## term_year full_part `hours_earned > 18` n
## <chr> <chr> <lgl> <int>
## 1 2020 FT FALSE 1986
## 2 2020 FT TRUE 236
## 3 2020 PT FALSE 1490
## 4 2020 PT TRUE 37
## 5 2021 FT FALSE 1896
## 6 2021 FT TRUE 192
## 7 2021 PT FALSE 1248
## 8 2021 PT TRUE 38
# create variable pass_earned_hours
# hours_earned_rate = hours_earned/hours_attempted
df_Degrees<-df_Degrees %>%
mutate(hours_earned_rate=hours_earned/hours_attempted)
Density plot of Hours Earned Rate by year
ggplot(df_Degrees, aes(hours_earned_rate, fill = term_year)) + geom_density(alpha = 0.3) +
facet_wrap(~full_part)+
xlab("Hours Earned Rate") +
ylab( "Density")+
xlim(0,1)
## Warning: Removed 4 rows containing non-finite values (stat_density).
Unearned hours are attempted hours that the student did not receive credit for.
df_Degrees<-df_Degrees %>%
mutate(unearned_hours = hours_attempted-hours_earned)
Density plot of unearned hours per a year
ggplot(df_Degrees, aes(unearned_hours, fill = term_year)) + geom_density(alpha = 0.3) +
facet_wrap(~full_part)+
xlab("Unearned hours") +
ylab( "Density") +
xlim (0,18)
## Warning: Removed 31 rows containing non-finite values (stat_density).
Histogram of GPA by year
p <- ggplot(df_Degrees, aes(x = mc_gpa))+ geom_histogram(aes(fill = term_year))
p7<- p + facet_wrap(~full_part)
p7
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Boxplots of GPA by year
p15 = ggplot(df_Degrees, aes(mc_gpa))
p15 + geom_boxplot(aes(colour = term_year)) +
facet_wrap(~full_part)
Density plot of GPA by year
ggplot(df_Degrees, aes(mc_gpa, fill = term_year)) + geom_density(alpha = 0.2) +
facet_wrap(~full_part)+
xlab("GPA") +
ylab( "Density")
GPA Fivenum Summary
df_Degrees%>% group_by(term_year,full_part)%>%
summarise(n = n(),
min = fivenum(mc_gpa)[1],
Q1 = fivenum(mc_gpa)[2],
median = fivenum(mc_gpa)[3],
Q3 = fivenum(mc_gpa)[4],
max = fivenum(mc_gpa)[5],
mean= mean(mc_gpa),
sd = sd(mc_gpa))
## `summarise()` regrouping output by 'term_year' (override with `.groups` argument)
## # A tibble: 4 x 10
## # Groups: term_year [2]
## term_year full_part n min Q1 median Q3 max mean sd
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020 FT 2222 0 1.69 2.92 3.57 4 2.53 1.26
## 2 2020 PT 1527 0 0 1.63 3 4 1.67 1.60
## 3 2021 FT 2088 0 1.5 2.85 3.5 4 2.43 1.34
## 4 2021 PT 1286 0 0 2 3.33 4 1.81 1.59
library(GGally)
# plot distributions and correlation of variables
df_Degrees%>% filter(term_year=="2020")%>%
filter(full_part=="FT")%>%
ggpairs(., columns = c("hours_attempted","hours_earned", "mc_gpa","hours_earned_rate","unearned_hours"))